Using eBird trends and status abundance data, can we find a minimum number of city total checklists required, and a cut off percentage of checklists recorded, such that we can be confident that any species recorded on more than that percentage of checklist is present in the city (given the city has more that the total minimum total checklists recorded within it).
source('../env.R')
options(warn=-1)
taxonomic_mapping = read_csv(filename(TAXONOMY_OUTPUT_DIR, 'taxonomy_mapping.csv'))
Rows: 444 Columns: 8── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (6): species_name, birdlife_common_name, ebird_common_name, ebird_species_fullname, ebird_species_name, jetz_species_name
dbl (2): birdlife_id, ebird_id
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Find sample size, and percentage of checklists (or percentage of effort) for valid urban communities based on presence test data.
This data contains the mean and median abundance across three different seasons for each city (breeding, nonbreeding, resident). The data comes from the eBird status and trends data sets and is calculated from all the pixels that occur within a city polygon for each species.
abundance_data = read_csv(filename(EBIRD_WORKING_OUTPUT_DIR, 'ebird_trends_abundance_test_data.csv'))
Rows: 30876 Columns: 13── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (2): species, city_name
dbl (11): ID, mean_breeding, median_breeding, sd_breeding, mean_nonbreeding, median_nonbreeding, sd_nonbreeding, mean_resident, median_resident, sd_resident, ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(abundance_data)
These are the species available for our test
test_species = unique(abundance_data$species)
test_species
[1] "Geopelia humeralis" "Patagioenas fasciata" "Macropygia phasianella" "Patagioenas araucana" "Columbina passerina"
[6] "Columba palumbus" "Ocyphaps lophotes" "Zenaida auriculata" "Streptopelia decaocto" "Columbina inca"
[11] "Zenaida macroura" "Hemiphaga novaeseelandiae" "Streptopelia orientalis" "Patagioenas picazuro" "Columbina picui"
[16] "Patagioenas flavirostris" "Streptopelia tranquebarica" "Streptopelia semitorquata" "Streptopelia capicola" "Columba livia"
[21] "Columbina talpacoti" "Patagioenas squamosa" "Patagioenas nigrirostris" "Columba guinea" "Spilopelia chinensis"
[26] "Zenaida meloda" "Patagioenas leucocephala" "Leptotila verreauxi" "Zenaida asiatica" "Treron phoenicopterus"
[31] "Zenaida aurita"
test_species_mapping = unique(taxonomic_mapping[taxonomic_mapping$ebird_species_name %in% test_species, c('ebird_species_name', 'jetz_species_name')])
test_species_mapping[order(test_species_mapping$ebird_species_name),] %>% arrange(jetz_species_name)
abundance_data_jetz = abundance_data %>% left_join(test_species_mapping, by=c('species' = 'ebird_species_name')) %>% dplyr::select(-c('species'))
abundance_data_jetz
This data is created from Birdlife (and some eBird) distrbution data, a species is recorded as occuring in a regional pool when its range polygon intersects with a city polygon.
A set of city names we will use for inspecting data:
inspection_cities = c('Manchester', 'Bogota', 'Los Angeles', 'Jakarta', 'Nairobi', 'Medellín', 'San Jose')
Load data
pool_data = read_csv(filename(COMMUNITY_OUTPUT_DIR, 'jetz_all_recorded_species.csv'))
Rows: 6359 Columns: 15── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (5): city_name, jetz_species_name, seasonal, presence, origin
dbl (10): city_id, total_city_checklists, total_city_locations, total_city_effort, total_presence_checklists, total_presence_effort, total_presence_locations,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pool_data
Map taxonomy to eBird
regional_pool_data = pool_data %>%
filter(jetz_species_name %in% test_species_mapping$jetz_species_name) %>%
left_join(test_species_mapping) %>%
arrange(city_id, jetz_species_name)
Joining, by = "jetz_species_name"
Final regional pool data
regional_pool_data[regional_pool_data$city_name %in% inspection_cities,]
The abundance data includes a record for every species in every city, even though many of those values will be 0 or NA. Here we filter out those erroneous rows by joining the abundance data to our regional pool data. This will ensure we only have test data (an abundance score) for species that we expect to occur within a city.
abundance_data_regional = left_join(regional_pool_data, abundance_data_jetz, by = c('city_id' = 'city_id', 'city_name' = 'city_name', 'jetz_species_name' = 'jetz_species_name'))
abundance_data_regional = abundance_data_regional %>% arrange(city_id) %>%
dplyr::select(-c('ID'))
Our new abundance data:
abundance_data_regional[abundance_data_regional$city_name %in% inspection_cities,]
Check to see if we have filtered out any species that should cause us concern. This list contains all species that have a non-zero abundance but have now been removed from our data (e.g. they do not occur within the regional pool of the city, or more specifically their birdlife/ebird range does not overlap the city vector). We can fix this by seeking alternative ways of defining the regional pools, e.g. by appending the eBird status and trends ranges for a species.
city_id_species_pairs_present = paste(abundance_data_regional$jetz_species_name, abundance_data_regional$city_id, sep = '::')
abundance_data_jetz %>%
filter(mean_breeding > 0 | mean_nonbreeding > 0 | mean_resident > 0) %>%
group_by(city_id, jetz_species_name) %>%
filter(!(paste(jetz_species_name, city_id, sep = '::') %in% city_id_species_pairs_present)) %>%
dplyr::select(city_id, city_name, jetz_species_name, mean_breeding, mean_nonbreeding, mean_resident)
plot_city_species = function(city_id) {
city_row = abundance_data_regional[abundance_data_regional$city_id == city_id, c('city_name', 'total_city_checklists')]
ggplot(abundance_data_regional[abundance_data_regional$city_id == city_id,], aes(x = jetz_species_name, y = percentage_checklists, fill = seasonal)) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste('breeding', round(mean_breeding, 1), '\nresident', round(mean_resident, 1)))) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
xlab('Species name') + ylab('% of Checklist present') +
facet_wrap(~city_id) +
geom_hline(yintercept=5, linetype='dotted', col = 'red') +
labs(title = paste(city_row$city_name, city_row$total_city_checklists))
}
Here we show the city and total number of checklists. Then for each species plotted against percentage of checklists, the breeding and resident abundance.
plot_city_species(14)
plot_city_species(624)
plot_city_species(1786)
plot_city_species(4830)
plot_city_species(11920)
Each species is recorded on checklists, within each city we have found the percentage of all checklists within a city that a species is recorded on. A city with a low number of total checklists is likely to have eratic and unreliable values for species percentage of checklists. Here we try to find out what is the minimum number of checklists for a city to have, for percentage of total checklists to map to abundance.
ggplot(abundance_data_regional[abundance_data_regional$jetz_species_name == 'Columba_palumbus',], aes(y = percentage_checklists, x = mean_breeding, color = log(total_city_checklists))) + geom_jitter() + geom_jitter() + scale_colour_gradient2(
low = "red",
mid = "yellow",
high = "darkgreen",
midpoint = log(100),
space = "Lab",
na.value = "grey50",
guide = "colourbar",
aesthetics = "colour",
) +
xlab('Mean breeding abundance') + ylab('Percentage of city checklists') +
labs(title = 'Woodpigeon data across all cities', color = 'Log of total\nnumber of\ncity checklists')
ggplot(abundance_data_regional, aes(y = percentage_checklists, x = sqrt(mean_breeding), color = log(total_city_checklists))) + geom_jitter() + geom_jitter() + scale_colour_gradient2(
low = "red",
mid = "yellow",
high = "darkgreen",
midpoint = log(100),
space = "Lab",
na.value = "grey50",
guide = "colourbar",
aesthetics = "colour",
) + geom_smooth() +
xlab('Sqrt of Mean breeding abundance') + ylab('Percentage of city checklists') +
labs(title = 'All species data across all cities', color = 'Log of total\nnumber of\ncity checklists')
Outliers, where sqrt of breeding abundance greater than 3.5
abundance_data_regional[!is.na(abundance_data_regional$mean_breeding) & sqrt(abundance_data_regional$mean_breeding) > 3.5,]
ggplot(abundance_data_regional, aes(y = percentage_checklists, x = sqrt(mean_resident), color = log(total_city_checklists))) + geom_jitter() + geom_jitter() + scale_colour_gradient2(
low = "red",
mid = "yellow",
high = "darkgreen",
midpoint = log(100),
space = "Lab",
na.value = "grey50",
guide = "colourbar",
aesthetics = "colour",
) + geom_smooth() +
xlab('Sqrt of Mean resident abundance') + ylab('Percentage of city checklists') +
labs(title = 'All species data across all cities', color = 'Log of total\nnumber of\ncity checklists')
Outliers, where sqrt of resident abundance greater than 5
abundance_data_regional[!is.na(abundance_data_regional$mean_resident) & sqrt(abundance_data_regional$mean_resident) > 5,]
max_abundance is the maximum of breeding, nonbreeding, or resident abundance.
Plot that against percentage of checklists, remove outliers where sqrt of max_abundance greater than 5.
ggplot(abundance_data_regional[!is.na(abundance_data_regional$max_mean_abundance) & sqrt(abundance_data_regional$max_mean_abundance) < 5,], aes(y = percentage_checklists, x = sqrt(max_mean_abundance), color = log(total_city_checklists))) + geom_jitter() + geom_jitter() + scale_colour_gradient2(
low = "red",
mid = "yellow",
high = "darkgreen",
midpoint = log(100),
space = "Lab",
na.value = "grey50",
guide = "colourbar",
aesthetics = "colour",
) + geom_smooth() +
xlab('Sqrt of Max mean abundance') + ylab('Percentage of city checklists') +
labs(title = 'All species data across all cities', subtitle='Sqrt of max mean abundance < 5', color = 'Log of total\nnumber of\ncity checklists')
Create trimmed dataset, removing all species with a sqrt of max_abundance greater than 5.
trimmed_data = abundance_data_regional[!is.na(abundance_data_regional$max_mean_abundance) & sqrt(abundance_data_regional$max_mean_abundance) < 5,]
Here we try to predict abundance from the percentage of checklists a species occurs on. We try to find a cut off for the minimum total number of checklists recorded in a city to improve the prediction. For each test we split the cities 50/50 into training and test data, and run the test 20 times for each cut off.
seeds = c(123, 456, 678, 10, 11, 345, 32, 11, 54, 90, 9999, 1234, 5678, 2323, 9011, 532, 111, 678, 6501, 3)
# min_city_checklists is the minimum number of cities a checklist must have to be included.
# get_model is a function that takes a dataframe subset of `abundance_data_regional`, and returns the result of `lm`
# get_actual is a function that returns the expected result of the prediction given a dataframe subset of `abundance_data_regional`
get_mean_test_error = function(min_city_checklists, get_model, get_expected) {
data = trimmed_data[trimmed_data$total_city_checklists >= min_city_checklists,]
result = data.frame()
for(seed in seeds) {
set.seed(seed)
train = sample(1:nrow(data), nrow(data)/2)
test=(-train)
model = get_model(data[train,])
prediction = predict(model, data[test,])
result = rbind(result, data.frame(prediction = prediction, actual = get_expected(data[test,])))
}
sum((result$prediction - result$actual)^2) / nrow(result)
}
min_city_checklists = seq(0, 5000, by=10)
result_abundance = data.frame()
for (min_city_checklist in min_city_checklists) {
result_abundance = rbind(result_abundance, data.frame(
min_city_checklist = min_city_checklist,
mean_test_error = get_mean_test_error(
min_city_checklist,
function(training_data) {lm(max_mean_abundance ~ percentage_checklists, training_data)},
function(test_data) {test_data$max_mean_abundance})
))
}
ggplot(result_abundance, aes(x = min_city_checklist, y = mean_test_error)) + geom_line() +
xlab('Minimum total number of city checklists') + ylab('Mean test error for all 20 tests') +
labs(title = 'Predicting max abundance of a species in a city from the percentage of checklists')
Here we repeat the above test, but instead try to predict the sqrt of max abundance.
result_sqrt_abundance = data.frame()
for (min_city_checklist in min_city_checklists) {
result_sqrt_abundance = rbind(result_sqrt_abundance, data.frame(
min_city_checklist = min_city_checklist,
mean_test_error = get_mean_test_error(
min_city_checklist,
function(training_data) {lm(sqrt(max_mean_abundance) ~ percentage_checklists, training_data)},
function(test_data) {sqrt(test_data$max_mean_abundance)})
))
}
ggplot(result_sqrt_abundance, aes(x = min_city_checklist, y = mean_test_error)) + geom_line() +
xlab('Minimum total number of city checklists') + ylab('Mean test error for all 20 tests') +
labs(title = 'Predicting sqrt max abundance from the percentage of checklists')
Here we try a simpler test of whether a species is present or not. Set present as having whether a species has a max_mean_abundance greater than 0
trimmed_data$nonzero_abundance = trimmed_data$max_mean_abundance > 0
Bin data into 25 bins based on number of total city checklists. Boxplot percentage of checklists based on whether species are present (i.e. max_mean_abundance greater than 0)
trimmed_data$total_city_checklists_bin = cut(trimmed_data$total_city_checklists, breaks = 25, labels = F)
trimmed_data$total_city_checklists_bin = as.factor(trimmed_data$total_city_checklists_bin)
ggplot(trimmed_data, aes(x = total_city_checklists_bin, y = percentage_checklists)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + facet_wrap(~ nonzero_abundance) + xlab('Total city checklists group') + ylab('Percentage of city checklists') +
labs(title = 'Boxplot of percentage of city checklists\ngiven whether the species is considered present')
Here we can see that there is a prediction problem with species considered present that have a very low percentage of recorded checklists.
get_true_false_test_results = function(min_city_checklists, df = trimmed_data) {
data = df[df$total_city_checklists >= min_city_checklists,]
result = data.frame()
coeeffs = c()
for(seed in seeds) {
train = sample(1:nrow(data), nrow(data) / 2)
test=(-train)
model = glm(nonzero_abundance ~ percentage_checklists, "binomial", data[train,])
prediction = predict(model, data[test,], type = "response")
result = rbind(result, data.frame(actual = data$nonzero_abundance[test], predicted = prediction))
coeeffs = append(coeeffs, model$coefficients['percentage_checklists'])
}
data.frame(
mean_false_probability = mean(result$predicted[!result$actual]),
mean_true_probablility = mean(result$predicted[result$actual]),
percentage_checklists_cutoff = mean(coeeffs),
min_city_checklists = min_city_checklists
)
}
Try predicting species presence (i.e. max_mean_abundance > 0) based on the percentage of checklists a species is recorded on. For each run we predict the result in our test data, and then find the mean prediction for true vs false.
ggplot(result_present, aes(x = min_city_checklists, y = mean_false_probability)) + geom_line() +
xlab('Min total city checklists') + ylab('Mean presence probablity, given species absent')
Some results at different minimum total city checklists, the test estimate gives us a percentage of checklists cut off for whether a species is present. However, we can see that the mean probabilty for present remains high for species that are absent.
result_present[result_present$min_city_checklists %in% c(100, 500, 1000, 1500, 2000),]
Take a look at those records that are expected present, but have low percentage of checklists recorded. Is there any pattern here?
problem_rows = trimmed_data[trimmed_data$nonzero_abundance & trimmed_data$percentage_checklists < 2.5 & trimmed_data$total_city_checklists > 100,
c('city_name', 'jetz_species_name', 'total_city_checklists', 'mean_breeding', 'median_breeding', 'mean_resident', 'median_resident', 'mean_nonbreeding', 'median_nonbreeding', 'seasonal', 'percentage_checklists', 'city_id')]
problem_rows
Any particular species?
problem_rows %>% group_by(jetz_species_name) %>% summarise(
total_problem_records = n(),
mean_mean_breeding = mean(mean_breeding),
max_mean_breeding = max(mean_breeding),
mean_median_breeding = mean(median_breeding),
max_median_breeding = max(median_breeding),
mean_mean_nonbreeding = mean(mean_nonbreeding),
max_mean_nonbreeding = max(mean_nonbreeding),
mean_median_nonbreeding = mean(median_nonbreeding),
max_median_nonbreeding = max(median_nonbreeding),
mean_mean_resident = mean(mean_resident),
max_mean_resident = max(mean_resident),
mean_median_resident = mean(median_resident),
max_median_resident = max(median_resident)
) %>% arrange(total_problem_records)
Any particular geographical regions?
Reading layer `WB_countries_Admin0_10m' from data source `/Users/james/Dropbox/PhD/WorldBank_countries_Admin0_10m/WB_countries_Admin0_10m.shp' using driver `ESRI Shapefile'
Simple feature collection with 251 features and 52 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -180 ymin: -59.47275 xmax: 180 ymax: 83.6341
Geodetic CRS: WGS 84
Reading layer `initial_selection' from data source `/Users/james/Dropbox/PhD/urban_community_structure_wrk/geo/cities/initial_selection.shp' using driver `ESRI Shapefile'
Simple feature collection with 996 features and 4 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: -158.043 ymin: -38.20685 xmax: 174.9891 ymax: 60.3276
Geodetic CRS: WGS 84
Joining, by = c("city_name", "city_id")`summarise()` has grouped output by 'city_id', 'city_name'. You can override using the `.groups` argument.
Exclude
problem_species_to_exclude = c('Streptopelia_orientalis', 'Leptotila_verreauxi', 'Streptopelia_decaocto', 'Streptopelia_tranquebarica')
ggplot(result_present_exclude_species, aes(x = min_city_checklists, y = mean_false_probability)) + geom_line() +
xlab('Min total city checklists') + ylab('Mean presence probablity, given species absent')
result_present_exclude_species[result_present_exclude_species$min_city_checklists %in% c(100, 500, 1000, 1500, 2000),]
Exclude cities in Southern Asia and Central America
cities_joined_to_world = st_join(initial_city_selection, world_map)
although coordinates are longitude/latitude, st_intersects assumes that they are planar
problem_cities_to_exclude = cities_joined_to_world$city_id[cities_joined_to_world$SUBREGION %in% c('Southern Asia', 'Central America')]
ggplot(result_present_exclude_cities, aes(x = min_city_checklists, y = mean_false_probability)) + geom_line() +
xlab('Min total city checklists') + ylab('Mean presence probablity, given species absent')
result_present_exclude_cities[result_present_exclude_cities$min_city_checklists %in% c(100, 500, 1000, 1500, 2000),]
min(result_present_exclude_cities$min_city_checklists[result_present_exclude_cities$mean_false_probability < 0.5])
[1] 350
round(result_present_exclude_cities$percentage_checklists_cutoff[result_present_exclude_cities$min_city_checklists == 350], 1)
[1] 3.6
If we’re happy that we can exclude South American and South Asian cities due to poor eBird sampling data leading to their models producing invalid abundance. Then selecting the first test that produces an average true probability under 0.5 for absent species, means we should set the minimum required number of checklists within a city to 380, and the percentage cut-off for a species being present as 3.6%.
The number of cities in the analysis is then:
length(unique(trimmed_data$city_id[trimmed_data$total_city_checklists >= 350]))
[1] 297
Distribution of those cities
selected_cities_for_study = unique(trimmed_data$city_id[trimmed_data$total_city_checklists >= 350])
selected_city_centres = st_centroid(initial_city_selection[initial_city_selection$city_id %in% selected_cities_for_study,])
Warning: st_centroid assumes attributes are constant over geometries of xWarning: st_centroid does not give correct centroids for longitude/latitude data
ggplot() +
geom_sf(data = world_map, aes(geometry = geometry)) +
geom_sf(data = selected_city_centres, aes(geometry = geometry), colour = "red")